BMI is measured as an individual’s weight in kilograms divided by the individual’s height in meters squared. Different categories of weight are defined with the following cutoffs:
An article published in Nature evaluated and compared the Body-Mass Index (BMI) of populations in rural and urban communities around the world:
NCD Risk Factor Collaboration (NCD-RisC). Rising rural body-mass index is the main driver of the global obesity epidemic in adults. Nature 569, 260–264 (2019).
This article challenged the widely-held view that increased urbanization was one of the major reasons for increased global obesity rates. This view came about because many countries around the world have shown increased urbanization levels in parallel with increased obesity rates. However this study demonstrated that this might not be the case. This case study will evaluate the data reported in this article to explore regional and gender specific differences in the obesity rates around the world in 1985 and 2017. Most importantly we will test if there is a difference in obesity rates between rural and urban communities.
Our main questions are:
In this case study, we’ll walk you through importing data from a pdf, cleaning data, wrangling data, visualizing the data, and comparing the means of two groups using well-established and commonly used packages, including stringr, tidyr, dplyr, purrr, and ggplot2. We will especially focus on using packages and functions from the Tidyverse. The Tidyverse is a library of packages created by the chief scientist at RStudio, Hadley Wickham. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.
We will begin by loading the packages that we will need:
library(here)
library(pdftools)
library(stringr)
library(readr)
library(dplyr)
library(tibble)
library(magrittr)
library(purrr)
library(tidyr)
library(ggplot2)
library(data.table)
library(ggrepel)
library(cowplot)| Package | Use |
|---|---|
| here | to easily load and save data |
| pdftools | to read a pdf into R |
| stringr | to manipulate the text within the pdf of the data |
| readr | to manipulate the text within the pdf of the data into individual lines |
| dplyr | to arrange/filter/select subsets of the data |
| tibble | to create data objects that we can manipulate with dplyr/stringr/tidyr/purrr |
| magrittr | to use the %<>% pipping operator |
| purrr | to perform functions on all columns of a tibble |
| tidyr | to convert data from wide to long format |
| ggplot2 | to make visualizations with multiple layers |
| data.table | to create data objects that are similar to tibbles but different |
| ggrepel | to allow labels in figures not to overlap |
| cowplot | to allow plots to be combined |
The first time we use a function, we will use the :: to indicate what package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to illustrate what package we are using.
The measurement of BMI has some limitations that are well recognized, as it does not account for the composition of body mass, the location of body fat, or the contribution of body frame size. However, BMI has been a useful health indicator for risk for many diseases particularly when combined with other risk factor information.
We will be using data located within a table of the supplementary material for the NCD-RisC paper referenced above. This is a pdf that can be found freely available online.
Here you can see that the data contains mean BMI values for both Men and Women in various countries at the national level, as well as the mean BMI values for the rural and urban areas of these countries for both 1985 and 2017.
The data within the parentheses are the 95 % credible interval (CIs) ranges for the mean BMI estimates. The authors provide these CIs as a guide to understand how likely the estimate is for the true population mean BMI. A wider range suggests that the estimate is less accurate, as there are more possible values for the true mean with credible evidence.
Note: While gender and sex are not actually binary, the data presented that is used in this analysis only contains data for groups of individuals described as men or women.
First let’s download the data:
url <- "https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-019-1171-x/MediaObjects/41586_2019_1171_MOESM1_ESM.pdf"
utils::download.file(url, "paper_supplement.pdf")Now that we have the obesity pdf, we will read it in to R using the pdftools package:
pdf_obesity<-pdftools::pdf_text(here("paper_supplement.pdf"))
# PDF error: Expected the optional content group list, but wasn't able to find it, or it isn't an Array
# errors like this can be common with PDFs... we can take a look at the data to see if it still looks as expectedLet’s take a look at the data- the summary() function helps us to look at the structure of R objects.
Length Class Mode
63 character character
We can see that we have 63 elements that are character strings. You may also notice that the original PDF has 63 pages. Let’s take a look at some of these elements.
[1] "Letter https://doi.org/10.1038/s41586-019-1171-x\nRising rural body-mass index is the main driver of\nthe global obesity epidemic in adults\nNCD Risk Factor Collaboration (NCD-RisC)*\n*A list of authors and their affiliations appears in the online version of the paper.\n2 6 0 | N A T U RE | V O L 5 6 9 | 9 M A Y 2 0 1 9\n"
[1] "Supplementary Information. Statistical model for estimating BMI trends by rural and urban\nplace of residence.\n 1\n"
We can see that the output looks pretty similar to the pages of the pdf, but the spacing is quite awkward. The way the data is displayed is partially influenced by how width setting of the RStudio window.
We are interested in a supplementary Table 3. which has multiple pages and includes the same header on each page; we can use that to determine what elements of our pdf_obesity character strings include our table. We will use the str_detect() function of the stringr package to search for the elements that contain text that is consistently in the header. The output of of this function will show which elements of the object (in this case pages of the pdf) include this pattern indicated as a “TRUE” or “FALSE”.
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[56] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
# "Age-standardised mean BMI" is part of the header in the table on every page in the pdf.
# This shows the pages that contain our table of interest (as TRUE values).Let’s extract just the data for the table now and call it rural_urban.
rural_urban <-pdf_obesity[stringr::str_detect(pattern ="Age-standardised mean BMI", string =pdf_obesity)]
str(rural_urban) chr [1:10] " 2 "| __truncated__ ...
# The str() function is similar to the summary function
# This shows us that there are 10 pages worth of elements in our rural_urban objectLet’s check the first and last page:
cat(rural_urban[1])
# Using the cat() function which stands for Concatenate And Print
# This will allow for the "\n" values to be shown as spacesThis looks the same as the beginning… how about the end?
Great! Our rural_urban object looks like it contains the entire Supplementary 3 table, as both the beginning and the end include the data we expected.
At this point we have large strings now for each page of the table, but this is not very convenient to work with. Now we will wrangle the data into a more usable form. Ideally we would like our data to be in some sort of tabular form.
First it would be useful to separate each page into lines.
chr [1:461] " 2 "| __truncated__ ...
[1] "Afghanistan Women 20.6 (18.4-22.8) 20.1 (17.8-22.4) 23.2 (20.9-25.4) 24.4 (23.3-25.4) 23.6 (22.5-24.8) 26.3 (25.1-27.4)"
We also have a lot of white-space… let’s get rid of the excess white spaces using str_squish().
[1] "2 2"
[2] "Age-standardised mean BMI in 1985 (kg/m ) Age-standardised mean BMI in 2017 (kg/m )"
[3] "Country Sex"
[4] "National Rural Urban National Rural Urban"
[5] "Men 20.2 (17.8-22.7) 19.7 (17.2-22.2) 22.4 (20.0-25.0) 22.8 (20.3-25.3) 22.5 (20.0-25.0) 23.6 (21.0-26.1)"
[6] "Afghanistan Women 20.6 (18.4-22.8) 20.1 (17.8-22.4) 23.2 (20.9-25.4) 24.4 (23.3-25.4) 23.6 (22.5-24.8) 26.3 (25.1-27.4)"
Now it is much easier to see the data.
If we look at the end of the first page of the table and the start of the second we can see that the header information is repeated, as well as a line with the page number and an empty line, and a line that says “2 2”.
[1] "Benin Women 20.6 (19.1-22.0) 20.1 (18.5-21.6) 21.8 (20.3-23.3) 24.3 (23.5-25.0) 23.2 (22.5-24.0) 25.5 (24.7-26.3)"
[2] "Men 24.3 (21.8-26.8) na* 24.3 (21.8-26.8) 26.7 (24.3-29.2) na* 26.7 (24.3-29.2)"
[3] "Bermuda Women 25.4 (22.2-28.6) na* 25.4 (22.2-28.6) 28.5 (25.3-31.6) na* 28.5 (25.3-31.6)"
[4] "Men 20.6 (19.1-22.1) 20.3 (18.7-21.8) 23.1 (21.5-24.6) 23.5 (22.8-24.3) 23.1 (22.2-23.9) 24.2 (23.3-25.2)"
[5] "Bhutan Women 20.7 (18.4-22.9) 20.3 (18.0-22.6) 23.0 (20.8-25.3) 24.6 (23.7-25.5) 23.8 (22.7-24.9) 25.9 (24.7-27.0)"
[6] "51"
[7] ""
[8] "2 2"
[9] "Age-standardised mean BMI in 1985 (kg/m ) Age-standardised mean BMI in 2017 (kg/m )"
[10] "Country Sex"
[11] "National Rural Urban National Rural Urban"
[12] "Men 23.3 (21.0-25.5) 22.9 (20.6-25.2) 23.6 (21.3-25.9) 26.1 (23.9-28.4) 25.3 (23.1-27.5) 26.5 (24.2-28.8)"
[13] "Bolivia Women 23.8 (22.3-25.3) 23.0 (21.4-24.5) 24.6 (23.1-26.1) 27.9 (26.5-29.3) 27.0 (25.6-28.5) 28.3 (26.8-29.7)"
Although the header was necessary on all of the pages of the pdf version of the table, we only need that information once in our data.
So, let’s remove all the header information and the page number lines from the rural_urban object, then we will make a single line header for the beginning. One way to do this is to find all lines that include either “women” or “men” and only keep this data.
[1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
[18] 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
[35] 39 40 41 42 43 44 45 46 47 48 55 56 57 58 59 60 61
[52] 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
[69] 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
[86] 96 97 98 105 106 107 108 109 110 111 112 113 114 115 116 117 118
[103] 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
[120] 136 137 138 139 140 141 142 143 144 145 146 147 148 155 156 157 158
[137] 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
[154] 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
[171] 193 194 195 196 197 198 205 206 207 208 209 210 211 212 213 214 215
[188] 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232
[205] 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 255
[222] 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
[239] 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289
[256] 290 291 292 293 294 295 296 297 298 305 306 307 308 309 310 311 312
[273] 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329
[290] 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346
[307] 347 348 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369
[324] 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386
[341] 387 388 389 390 391 392 393 394 395 396 397 398 405 406 407 408 409
[358] 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426
[375] 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443
[392] 444 445 446 447 448 455 456 457 458
OK, so this looks correct. This includes most lines but there are gaps where are header is located.
We can check our data now using either head() or glimpse().
chr [1:400] "Men 20.2 (17.8-22.7) 19.7 (17.2-22.2) 22.4 (20.0-25.0) 22.8 (20.3-25.3) 22.5 (20.0-25.0) 23.6 (21.0-26.1)" ...
[1] "Men 20.2 (17.8-22.7) 19.7 (17.2-22.2) 22.4 (20.0-25.0) 22.8 (20.3-25.3) 22.5 (20.0-25.0) 23.6 (21.0-26.1)"
[2] "Afghanistan Women 20.6 (18.4-22.8) 20.1 (17.8-22.4) 23.2 (20.9-25.4) 24.4 (23.3-25.4) 23.6 (22.5-24.8) 26.3 (25.1-27.4)"
[3] "Men 25.2 (23.9-26.5) 25.0 (23.7-26.4) 25.4 (24.0-26.7) 27.0 (26.0-27.9) 26.9 (25.9-27.9) 27.0 (26.0-28.0)"
[4] "Albania Women 26.0 (24.1-27.9) 26.1 (24.1-28.1) 25.9 (23.9-27.8) 26.0 (24.8-27.2) 26.2 (24.8-27.5) 25.9 (24.6-27.2)"
[5] "Men 22.1 (20.8-23.3) 21.8 (20.5-23.1) 22.3 (21.0-23.6) 25.1 (24.5-25.7) 24.8 (24.1-25.4) 25.2 (24.6-25.9)"
[6] "Algeria Women 24.0 (22.2-25.7) 23.3 (21.4-25.1) 24.8 (22.9-26.6) 27.4 (26.7-28.0) 27.0 (26.3-27.8) 27.5 (26.7-28.2)"
So the head() function shows us the first rows or lines of the data, while the glimpse() function provides us information about the total size of the object and shows us the first line or row.
Great! So now our data looks much better but we need to add back our header and we would like this to only be a single line to make it easy to transform our data into a table or table-like object.
First let’s try splitting our header-less data into columns based on spaces:
[,1] [,2] [,3] [,4] [,5]
[1,] "Men" "20.2" "(17.8-22.7)" "19.7" "(17.2-22.2)"
[2,] "Afghanistan" "Women" "20.6" "(18.4-22.8)" "20.1"
[3,] "Men" "25.2" "(23.9-26.5)" "25.0" "(23.7-26.4)"
[4,] "Albania" "Women" "26.0" "(24.1-27.9)" "26.1"
[5,] "Men" "22.1" "(20.8-23.3)" "21.8" "(20.5-23.1)"
[6,] "Algeria" "Women" "24.0" "(22.2-25.7)" "23.3"
[7,] "Men" "33.7" "(32.7-34.7)" "32.6" "(31.7-33.5)"
[8,] "American" "Samoa" "Women" "34.3" "(33.1-35.6)"
[9,] "Men" "25.0" "(22.5-27.4)" "25.3" "(22.8-27.7)"
[10,] "Andorra" "Women" "25.2" "(22.0-28.4)" "25.4"
[,6] [,7] [,8] [,9]
[1,] "22.4" "(20.0-25.0)" "22.8" "(20.3-25.3)"
[2,] "(17.8-22.4)" "23.2" "(20.9-25.4)" "24.4"
[3,] "25.4" "(24.0-26.7)" "27.0" "(26.0-27.9)"
[4,] "(24.1-28.1)" "25.9" "(23.9-27.8)" "26.0"
[5,] "22.3" "(21.0-23.6)" "25.1" "(24.5-25.7)"
[6,] "(21.4-25.1)" "24.8" "(22.9-26.6)" "27.4"
[7,] "34.0" "(32.9-35.1)" "34.3" "(33.0-35.6)"
[8,] "34.1" "(33.0-35.2)" "34.4" "(33.0-35.8)"
[9,] "25.0" "(22.5-27.4)" "26.8" "(24.4-29.2)"
[10,] "(22.2-28.7)" "25.2" "(22.0-28.4)" "25.3"
[,10] [,11] [,12] [,13]
[1,] "22.5" "(20.0-25.0)" "23.6" "(21.0-26.1)"
[2,] "(23.3-25.4)" "23.6" "(22.5-24.8)" "26.3"
[3,] "26.9" "(25.9-27.9)" "27.0" "(26.0-28.0)"
[4,] "(24.8-27.2)" "26.2" "(24.8-27.5)" "25.9"
[5,] "24.8" "(24.1-25.4)" "25.2" "(24.6-25.9)"
[6,] "(26.7-28.0)" "27.0" "(26.3-27.8)" "27.5"
[7,] "34.6" "(33.1-35.9)" "34.2" "(32.9-35.6)"
[8,] "35.3" "(33.7-36.9)" "35.0" "(33.1-36.9)"
[9,] "26.8" "(24.3-29.2)" "26.8" "(24.3-29.3)"
[10,] "(22.1-28.6)" "25.2" "(21.9-28.5)" "25.3"
[,14] [,15] [,16] [,17] [,18]
[1,] "" "" "" "" ""
[2,] "(25.1-27.4)" "" "" "" ""
[3,] "" "" "" "" ""
[4,] "(24.6-27.2)" "" "" "" ""
[5,] "" "" "" "" ""
[6,] "(26.7-28.2)" "" "" "" ""
[7,] "" "" "" "" ""
[8,] "35.4" "(33.7-37.1)" "" "" ""
[9,] "" "" "" "" ""
[10,] "(22.1-28.6)" "" "" "" ""
This almost worked, but unfortunately country names that have spaces will be a problem. We can see that American Samoa has been divided into two columns and all subsequent columns are shifted.
So now let’s try to extract the country information, by separating the country information from the sex information when the sex is female. Sex always starts with either a capital “W” if the gender is female. We need to use a space before the “W” otherwise we will split some of the country names if the names starts with “W”. Here we will also introduce the concept of piping, which uses the %>%. This is really useful when we have multiple steps, which we will show soon.
Women <- stringr::str_subset(string = rural_urban, pattern = "Women")
# First let's subset the data to only the lines that contain "Women"
country_split <-Women %>%
stringr::str_split(pattern= " W") %>%
unlist()
head(country_split)[1] "Afghanistan"
[2] "omen 20.6 (18.4-22.8) 20.1 (17.8-22.4) 23.2 (20.9-25.4) 24.4 (23.3-25.4) 23.6 (22.5-24.8) 26.3 (25.1-27.4)"
[3] "Albania"
[4] "omen 26.0 (24.1-27.9) 26.1 (24.1-28.1) 25.9 (23.9-27.8) 26.0 (24.8-27.2) 26.2 (24.8-27.5) 25.9 (24.6-27.2)"
[5] "Algeria"
[6] "omen 24.0 (22.2-25.7) 23.3 (21.4-25.1) 24.8 (22.9-26.6) 27.4 (26.7-28.0) 27.0 (26.3-27.8) 27.5 (26.7-28.2)"
Now we can see that Country is always the odd rows and the rest of the data is the rest of the rows
We can take a look to make sure that all the country names look as expected:
country_split
1 Afghanistan
2 Albania
3 Algeria
4 American Samoa
5 Andorra
6 Angola
7 Antigua and Barbuda
8 Argentina
9 Armenia
10 Australia
11 Austria
12 Azerbaijan
13 Bahamas
14 Bahrain
15 Bangladesh
16 Barbados
17 Belarus
18 Belgium
19 Belize
20 Benin
21 Bermuda
22 Bhutan
23 Bolivia
24 Bosnia and Herzegovina
25 Botswana
26 Brazil
27 Brunei Darussalam
28 Bulgaria
29 Burkina Faso
30 Burundi
31 Cabo Verde
32 Cambodia
33 Cameroon
34 Canada
35 Central African Republic
36 Chad
37 Chile
38 China
39 China (Hong Kong SAR)
40 Colombia
41 Comoros
42 Congo
43 Cook Islands
44 Costa Rica
45 Cote d'Ivoire
46 Croatia
47 Cuba
48 Cyprus
49 Czech Republic
50 Denmark
51 Djibouti
52 Dominica
53 Dominican Republic
54 DR Congo
55 Ecuador
56 Egypt
57 El Salvador
58 Equatorial Guinea
59 Eritrea
60 Estonia
61 Ethiopia
62 Fiji
63 Finland
64 France
65 French Polynesia
66 Gabon
67 Gambia
68 Georgia
69 Germany
70 Ghana
71 Greece
72 Greenland
73 Grenada
74 Guatemala
75 Guinea
76 Guinea Bissau
77 Guyana
78 Haiti
79 Honduras
80 Hungary
81 Iceland
82 India
83 Indonesia
84 Iran
85 Iraq
86 Ireland
87 Israel
88 Italy
89 Jamaica
90 Japan
91 Jordan
92 Kazakhstan
93 Kenya
94 Kiribati
95 Kuwait
96 Kyrgyzstan
97 Lao PDR
98 Latvia
99 Lebanon
100 Lesotho
101 Liberia
102 Libya
103 Lithuania
104 Luxembourg
105 Macedonia (TFYR)
106 Madagascar
107 Malawi
108 Malaysia
109 Maldives
110 Mali
111 Malta
112 Marshall Islands
113 Mauritania
114 Mauritius
115 Mexico
116 Micronesia (Federated States of)
117 Moldova
118 Mongolia
119 Montenegro
120 Morocco
121 Mozambique
122 Myanmar
123 Namibia
124 Nauru
125 Nepal
126 Netherlands
127 New Zealand
128 Nicaragua
129 Niger
130 Nigeria
131 Niue
132 North Korea
133 Norway
134 Occupied Palestinian Territory
135 Oman
136 Pakistan
137 Palau
138 Panama
139 Papua New Guinea
140 Paraguay
141 Peru
142 Philippines
143 Poland
144 Portugal
145 Puerto Rico
146 Qatar
147 Romania
148 Russian Federation
149 Rwanda
150 Saint Kitts and Nevis
151 Saint Lucia
152 Saint Vincent and the Grenadines
153 Samoa
154 Sao Tome and Principe
155 Saudi Arabia
156 Senegal
157 Serbia
158 Seychelles
159 Sierra Leone
160 Singapore
161 Slovakia
162 Slovenia
163 Solomon Islands
164 Somalia
165 South Africa
166 South Korea
167 Spain
168 Sri Lanka
169 Sudan (former)
170 Suriname
171 Swaziland
172 Sweden
173 Switzerland
174 Syrian Arab Republic
175 Taiwan
176 Tajikistan
177 Tanzania
178 Thailand
179 Timor-Leste
180 Togo
181 Tokelau
182 Tonga
183 Trinidad and Tobago
184 Tunisia
185 Turkey
186 Turkmenistan
187 Tuvalu
188 Uganda
189 Ukraine
190 United Arab Emirates
191 United Kingdom
192 United States of America
193 Uruguay
194 Uzbekistan
195 Vanuatu
196 Venezuela
197 Viet Nam
198 Yemen
199 Zambia
200 Zimbabwe
Great! Now we have a list of the countries that can be used for both the male and female data.
Let’s add the “W” back to our Women data. We can do so by replacing the string “omen” with “Women” using str_repace().
chr [1:200] "Afghanistan Women 20.6 (18.4-22.8) 20.1 (17.8-22.4) 23.2 (20.9-25.4) 24.4 (23.3-25.4) 23.6 (22.5-24.8) 26.3 (25.1-27.4)" ...
Women<-Women_BMI %>%
mutate(country_split =stringr::str_replace(string =country_split, pattern ="omen", replacement = "Women"))
head(Women$country_split)[1] "Women 20.6 (18.4-22.8) 20.1 (17.8-22.4) 23.2 (20.9-25.4) 24.4 (23.3-25.4) 23.6 (22.5-24.8) 26.3 (25.1-27.4)"
[2] "Women 26.0 (24.1-27.9) 26.1 (24.1-28.1) 25.9 (23.9-27.8) 26.0 (24.8-27.2) 26.2 (24.8-27.5) 25.9 (24.6-27.2)"
[3] "Women 24.0 (22.2-25.7) 23.3 (21.4-25.1) 24.8 (22.9-26.6) 27.4 (26.7-28.0) 27.0 (26.3-27.8) 27.5 (26.7-28.2)"
[4] "Women 34.3 (33.1-35.6) 34.1 (33.0-35.2) 34.4 (33.0-35.8) 35.3 (33.7-36.9) 35.0 (33.1-36.9) 35.4 (33.7-37.1)"
[5] "Women 25.2 (22.0-28.4) 25.4 (22.2-28.7) 25.2 (22.0-28.4) 25.3 (22.1-28.6) 25.2 (21.9-28.5) 25.3 (22.1-28.6)"
[6] "Women 21.3 (18.0-24.6) 20.9 (17.6-24.3) 22.7 (19.3-26.0) 24.4 (21.2-27.7) 23.3 (20.0-26.7) 25.7 (22.4-29.1)"
It’s always a good idea to check that your data objects are the size you expect when wrangling. We can do so with the dim() function, which shows us the dimensions of data objects.
[1] 200 1
Great! There are 200 rows like we expected.
Let’s grab the male data:
# remember our rural_urban object contains male data for the odd rows
Men <-data.frame(rural_urban) %>%
dplyr::filter(row_number() %% 2 == 1)
head(Men$rural_urban)[1] Men 20.2 (17.8-22.7) 19.7 (17.2-22.2) 22.4 (20.0-25.0) 22.8 (20.3-25.3) 22.5 (20.0-25.0) 23.6 (21.0-26.1)
[2] Men 25.2 (23.9-26.5) 25.0 (23.7-26.4) 25.4 (24.0-26.7) 27.0 (26.0-27.9) 26.9 (25.9-27.9) 27.0 (26.0-28.0)
[3] Men 22.1 (20.8-23.3) 21.8 (20.5-23.1) 22.3 (21.0-23.6) 25.1 (24.5-25.7) 24.8 (24.1-25.4) 25.2 (24.6-25.9)
[4] Men 33.7 (32.7-34.7) 32.6 (31.7-33.5) 34.0 (32.9-35.1) 34.3 (33.0-35.6) 34.6 (33.1-35.9) 34.2 (32.9-35.6)
[5] Men 25.0 (22.5-27.4) 25.3 (22.8-27.7) 25.0 (22.5-27.4) 26.8 (24.4-29.2) 26.8 (24.3-29.2) 26.8 (24.3-29.3)
[6] Men 20.5 (17.9-23.1) 20.2 (17.6-22.9) 21.4 (18.8-24.0) 22.6 (20.0-25.1) 22.0 (19.4-24.6) 23.2 (20.6-25.9)
400 Levels: Afghanistan Women 20.6 (18.4-22.8) 20.1 (17.8-22.4) 23.2 (20.9-25.4) 24.4 (23.3-25.4) 23.6 (22.5-24.8) 26.3 (25.1-27.4) ...
[1] 200 1
How about our number of columns?
If we try splitting our data by space again, will it have the expected number of columns? What about the rows that contain na* values?
# Let's just take the Men data that contains na* values - this column is called rural_urban
unknown <-Men %>%filter(str_detect(pattern ="na\\*", string = rural_urban))
# Now we can try splitting by a space
tibble::as_tibble(str_split(unknown$rural_urban, " ", simplify = TRUE))# A tibble: 5 x 12
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Men 24.3 (21.8… na* 24.3 (21.… 26.7 (24.… na* 26.7 (24.… ""
2 Men 22.4 (21.4… 21.5 (20.2… 22.4 (21.… 24.8 (23.… na* 24.8 (23.…
3 Men 32.6 (32.0… na* 32.6 (32.… 32.9 (31.… na* 32.9 (31.… ""
4 Men 22.8 (22.2… na* 22.8 (22.… 24.4 (23.… na* 24.4 (23.… ""
5 Men 29.6 (28.1… 29.6 (28.1… na* 32.3 (31.… 32.3 (31.… na* ""
So close! Notice that the "na*" values have shifted the subsequent values within the columns because typically there is a space between the BMI and the credible intervals. Here we can see this data in our original pdf:
We need to replace our na* values with something that includes a space so that when we separate our data by space we will have two values instead of one when we have an na* . Therefore, na* na* should work.
[1] "factor"
# In order to use the functions within the stringr package, our Men data needs to be of character class
Men<-as.character(Men$rural_urban)
# This function allows us to change the factor data within Men$rural_urban to the character class
# We are changing the structure so that the data is directly just called Men and not within a vector called rural_urban
Men<-stringr::str_replace_all(string = Men, pattern = "na\\*", replacement = "na\\* na\\*")
# The \\ are necessary because the * is a special character
# The * would typically indicate any possible value, but here we actually want a "*" instead
# Thus the double backslash does that for us
# Here we are replacing all occurences of the na* values (thus str_replace_all instead of str_replace) with na* na*.
# To check that this worked...
Men[20:30] [1] "Men 20.7 (19.4-22.1) 20.3 (18.9-21.7) 21.7 (20.3-23.1) 22.7 (22.0-23.3) 22.1 (21.3-22.8) 23.4 (22.7-24.1)"
[2] "Men 24.3 (21.8-26.8) na* na* 24.3 (21.8-26.8) 26.7 (24.3-29.2) na* na* 26.7 (24.3-29.2)"
[3] "Men 20.6 (19.1-22.1) 20.3 (18.7-21.8) 23.1 (21.5-24.6) 23.5 (22.8-24.3) 23.1 (22.2-23.9) 24.2 (23.3-25.2)"
[4] "Men 23.3 (21.0-25.5) 22.9 (20.6-25.2) 23.6 (21.3-25.9) 26.1 (23.9-28.4) 25.3 (23.1-27.5) 26.5 (24.2-28.8)"
[5] "Men 24.9 (23.5-26.2) 24.8 (23.4-26.2) 25.0 (23.6-26.4) 26.8 (25.7-27.8) 26.7 (25.7-27.8) 26.8 (25.7-27.9)"
[6] "Men 20.6 (19.2-22.0) 20.3 (18.8-21.7) 21.4 (20.0-22.9) 22.6 (21.9-23.4) 21.9 (20.9-22.8) 23.2 (22.3-24.0)"
[7] "Men 23.3 (22.5-24.0) 22.2 (21.4-23.0) 23.7 (22.9-24.5) 26.2 (25.7-26.7) 25.0 (24.3-25.6) 26.4 (25.9-26.9)"
[8] "Men 24.0 (22.5-25.4) 23.7 (22.1-25.1) 24.2 (22.7-25.7) 27.1 (26.4-27.7) 26.6 (26.0-27.3) 27.2 (26.5-27.9)"
[9] "Men 25.0 (23.8-26.4) 24.9 (23.5-26.4) 25.1 (23.8-26.5) 27.0 (25.8-28.1) 27.0 (25.7-28.3) 27.0 (25.7-28.2)"
[10] "Men 20.2 (18.8-21.5) 20.0 (18.6-21.4) 21.4 (19.9-22.9) 22.2 (21.4-23.0) 21.8 (21.0-22.6) 23.1 (22.1-24.1)"
[11] "Men 20.2 (17.7-22.6) 20.1 (17.6-22.5) 21.2 (18.7-23.6) 22.1 (19.8-24.5) 22.0 (19.6-24.3) 23.2 (20.8-25.6)"
[1] "character"
# the female data is already in character form
Women<-stringr::str_replace_all(string = Women$country_split, pattern = "na\\*", replacement = "na\\* na\\*")
Women[20:30] [1] "Women 20.6 (19.1-22.0) 20.1 (18.5-21.6) 21.8 (20.3-23.3) 24.3 (23.5-25.0) 23.2 (22.5-24.0) 25.5 (24.7-26.3)"
[2] "Women 25.4 (22.2-28.6) na* na* 25.4 (22.2-28.6) 28.5 (25.3-31.6) na* na* 28.5 (25.3-31.6)"
[3] "Women 20.7 (18.4-22.9) 20.3 (18.0-22.6) 23.0 (20.8-25.3) 24.6 (23.7-25.5) 23.8 (22.7-24.9) 25.9 (24.7-27.0)"
[4] "Women 23.8 (22.3-25.3) 23.0 (21.4-24.5) 24.6 (23.1-26.1) 27.9 (26.5-29.3) 27.0 (25.6-28.5) 28.3 (26.8-29.7)"
[5] "Women 25.5 (23.5-27.4) 25.7 (23.7-27.7) 25.1 (23.1-27.0) 25.7 (24.4-27.1) 26.0 (24.6-27.5) 25.3 (23.9-26.8)"
[6] "Women 24.2 (22.2-26.2) 23.7 (21.5-25.8) 25.6 (23.5-27.7) 26.2 (25.3-27.1) 25.0 (23.8-26.2) 27.0 (26.0-28.2)"
[7] "Women 24.0 (23.2-25.0) 23.1 (22.1-24.1) 24.4 (23.5-25.4) 26.9 (26.2-27.5) 26.4 (25.7-27.2) 26.9 (26.3-27.6)"
[8] "Women 23.6 (21.4-25.6) 22.9 (20.7-25.0) 24.0 (21.8-26.1) 27.4 (26.6-28.1) 27.0 (26.1-27.8) 27.5 (26.7-28.2)"
[9] "Women 25.2 (23.4-27.1) 25.5 (23.4-27.6) 25.0 (23.1-26.9) 25.6 (24.1-27.1) 26.0 (24.2-27.8) 25.5 (23.9-27.1)"
[10] "Women 19.9 (18.7-21.2) 19.6 (18.4-20.9) 22.0 (20.7-23.3) 22.2 (21.4-23.1) 21.1 (20.2-22.0) 24.7 (23.7-25.7)"
[11] "Women 19.6 (17.5-21.7) 19.5 (17.4-21.6) 21.7 (19.5-23.8) 21.1 (20.3-22.0) 20.7 (19.8-21.6) 24.0 (23.0-24.9)"
Great, now we can split our data by spaces.
Men <-tibble::as_tibble(str_split(Men, " ", simplify = TRUE))
# Note here we need to use the column of Women that has the data
head(Men)# A tibble: 6 x 13
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Men 20.2 (17.8… 19.7 (17.2… 22.4 (20.… 22.8 (20.… 22.5 (20.… 23.6
2 Men 25.2 (23.9… 25.0 (23.7… 25.4 (24.… 27.0 (26.… 26.9 (25.… 27.0
3 Men 22.1 (20.8… 21.8 (20.5… 22.3 (21.… 25.1 (24.… 24.8 (24.… 25.2
4 Men 33.7 (32.7… 32.6 (31.7… 34.0 (32.… 34.3 (33.… 34.6 (33.… 34.2
5 Men 25.0 (22.5… 25.3 (22.8… 25.0 (22.… 26.8 (24.… 26.8 (24.… 26.8
6 Men 20.5 (17.9… 20.2 (17.6… 21.4 (18.… 22.6 (20.… 22.0 (19.… 23.2
# … with 1 more variable: V13 <chr>
Women<-as_tibble(str_split(Women, " ", simplify = TRUE))
# Note here we need to use the column of Women that has the data
head(Women)# A tibble: 6 x 13
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Women 20.6 (18.4… 20.1 (17.8… 23.2 (20.… 24.4 (23.… 23.6 (22.… 26.3
2 Women 26.0 (24.1… 26.1 (24.1… 25.9 (23.… 26.0 (24.… 26.2 (24.… 25.9
3 Women 24.0 (22.2… 23.3 (21.4… 24.8 (22.… 27.4 (26.… 27.0 (26.… 27.5
4 Women 34.3 (33.1… 34.1 (33.0… 34.4 (33.… 35.3 (33.… 35.0 (33.… 35.4
5 Women 25.2 (22.0… 25.4 (22.2… 25.2 (22.… 25.3 (22.… 25.2 (21.… 25.3
6 Women 21.3 (18.0… 20.9 (17.6… 22.7 (19.… 24.4 (21.… 23.3 (20.… 25.7
# … with 1 more variable: V13 <chr>
# A tibble: 11 x 13
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Women 20.6 (19.1… 20.1 (18.… 21.8 (20.… 24.3 (23.… 23.2 (22.… 25.5
2 Women 25.4 (22.2… na* na* 25.4 (22.… 28.5 (25.… na* na* 28.5
3 Women 20.7 (18.4… 20.3 (18.… 23.0 (20.… 24.6 (23.… 23.8 (22.… 25.9
4 Women 23.8 (22.3… 23.0 (21.… 24.6 (23.… 27.9 (26.… 27.0 (25.… 28.3
5 Women 25.5 (23.5… 25.7 (23.… 25.1 (23.… 25.7 (24.… 26.0 (24.… 25.3
6 Women 24.2 (22.2… 23.7 (21.… 25.6 (23.… 26.2 (25.… 25.0 (23.… 27.0
7 Women 24.0 (23.2… 23.1 (22.… 24.4 (23.… 26.9 (26.… 26.4 (25.… 26.9
8 Women 23.6 (21.4… 22.9 (20.… 24.0 (21.… 27.4 (26.… 27.0 (26.… 27.5
9 Women 25.2 (23.4… 25.5 (23.… 25.0 (23.… 25.6 (24.… 26.0 (24.… 25.5
10 Women 19.9 (18.7… 19.6 (18.… 22.0 (20.… 22.2 (21.… 21.1 (20.… 24.7
11 Women 19.6 (17.5… 19.5 (17.… 21.7 (19.… 21.1 (20.… 20.7 (19.… 24.0
# … with 1 more variable: V13 <chr>
We can see from our pdf and our object called header what the header was like. Let’s also make a new single line header, but let’s wait to add Country:
#header
new_header <-c("Sex","National_BMI_1985", "National_BMI_1985_CI", "Rural_BMI_1985", "Rural_BMI_1985_CI", "Urban_BMI_1985", "Urban_BMI_1985_CI", "National_BMI_2017", "National_BMI_2017_CI","Rural_BMI_2017", "Rural_BMI_2017_CI", "Urban_BMI_2017", "Urban_BMI_2017_CI")Let’s change the names of our columns of our tibble to this new header for our Men and Women data
Now we will add our country data to both our Men and Women tibbles:
Women <- dplyr::bind_cols(country, Women)
# This will add the country as a new column to the Women data object on the left
Men <- dplyr::bind_cols(country, Men)
# This will add the country as a new column to the Men data object on the left
head(Women) country_split Sex National_BMI_1985 National_BMI_1985_CI
1 Afghanistan Women 20.6 (18.4-22.8)
2 Albania Women 26.0 (24.1-27.9)
3 Algeria Women 24.0 (22.2-25.7)
4 American Samoa Women 34.3 (33.1-35.6)
5 Andorra Women 25.2 (22.0-28.4)
6 Angola Women 21.3 (18.0-24.6)
Rural_BMI_1985 Rural_BMI_1985_CI Urban_BMI_1985 Urban_BMI_1985_CI
1 20.1 (17.8-22.4) 23.2 (20.9-25.4)
2 26.1 (24.1-28.1) 25.9 (23.9-27.8)
3 23.3 (21.4-25.1) 24.8 (22.9-26.6)
4 34.1 (33.0-35.2) 34.4 (33.0-35.8)
5 25.4 (22.2-28.7) 25.2 (22.0-28.4)
6 20.9 (17.6-24.3) 22.7 (19.3-26.0)
National_BMI_2017 National_BMI_2017_CI Rural_BMI_2017 Rural_BMI_2017_CI
1 24.4 (23.3-25.4) 23.6 (22.5-24.8)
2 26.0 (24.8-27.2) 26.2 (24.8-27.5)
3 27.4 (26.7-28.0) 27.0 (26.3-27.8)
4 35.3 (33.7-36.9) 35.0 (33.1-36.9)
5 25.3 (22.1-28.6) 25.2 (21.9-28.5)
6 24.4 (21.2-27.7) 23.3 (20.0-26.7)
Urban_BMI_2017 Urban_BMI_2017_CI
1 26.3 (25.1-27.4)
2 25.9 (24.6-27.2)
3 27.5 (26.7-28.2)
4 35.4 (33.7-37.1)
5 25.3 (22.1-28.6)
6 25.7 (22.4-29.1)
Here we will change the variable name of the country data to country, currently it is called country_split. Here we will also introduce the concept of the assignment pipe. In this case our pipe operator looks like this %<>% . Using this fancier pipe requires another package, called magrittr. The other simpler pipe options from this package are loaded with tidyverse (if you used library(tidyverse) which loads most tidyverse packages), but not this fancier version.
The > portion of the pipe still behaves like a normal pipe, while the < portion of the pipe makes an assignment to whatever the <is pointing to, just like when we use the typical assignment operator <-.
# library(magrittr)
# We can't use the `%<>%` unless we load the magrittr package
# We have already done this but we include this for illustrative purposes.
# Here we will just use the traditional assignment strategy
Women<-rename(Women, Country = country_split)
# We have renamed the country_split variable to be called Country
# Here we reassign Men using the pipe strategy
Men %<>%rename(Country = country_split)
# We have renamed the country_split variable to to be called country
# We have also reassigned Men to the data object which has the country variable renamedNow we can combine our Men and Women data using the full_join() function of dplyr. This will put all the female data first (x) and all the male data second (y).
BMI<-dplyr::full_join(x = Men, y = Women)
# Let's check the size of our BMI data... it should have 400 rows (obs).
str(BMI)'data.frame': 400 obs. of 14 variables:
$ Country : Factor w/ 400 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Sex : chr "Men" "Men" "Men" "Men" ...
$ National_BMI_1985 : chr "20.2" "25.2" "22.1" "33.7" ...
$ National_BMI_1985_CI: chr "(17.8-22.7)" "(23.9-26.5)" "(20.8-23.3)" "(32.7-34.7)" ...
$ Rural_BMI_1985 : chr "19.7" "25.0" "21.8" "32.6" ...
$ Rural_BMI_1985_CI : chr "(17.2-22.2)" "(23.7-26.4)" "(20.5-23.1)" "(31.7-33.5)" ...
$ Urban_BMI_1985 : chr "22.4" "25.4" "22.3" "34.0" ...
$ Urban_BMI_1985_CI : chr "(20.0-25.0)" "(24.0-26.7)" "(21.0-23.6)" "(32.9-35.1)" ...
$ National_BMI_2017 : chr "22.8" "27.0" "25.1" "34.3" ...
$ National_BMI_2017_CI: chr "(20.3-25.3)" "(26.0-27.9)" "(24.5-25.7)" "(33.0-35.6)" ...
$ Rural_BMI_2017 : chr "22.5" "26.9" "24.8" "34.6" ...
$ Rural_BMI_2017_CI : chr "(20.0-25.0)" "(25.9-27.9)" "(24.1-25.4)" "(33.1-35.9)" ...
$ Urban_BMI_2017 : chr "23.6" "27.0" "25.2" "34.2" ...
$ Urban_BMI_2017_CI : chr "(21.0-26.1)" "(26.0-28.0)" "(24.6-25.9)" "(32.9-35.6)" ...
Now Lets sort the data by country:
Country Sex National_BMI_1985 National_BMI_1985_CI Rural_BMI_1985
1 Afghanistan Men 20.2 (17.8-22.7) 19.7
2 Afghanistan Women 20.6 (18.4-22.8) 20.1
3 Albania Men 25.2 (23.9-26.5) 25.0
4 Albania Women 26.0 (24.1-27.9) 26.1
5 Algeria Men 22.1 (20.8-23.3) 21.8
6 Algeria Women 24.0 (22.2-25.7) 23.3
Rural_BMI_1985_CI Urban_BMI_1985 Urban_BMI_1985_CI National_BMI_2017
1 (17.2-22.2) 22.4 (20.0-25.0) 22.8
2 (17.8-22.4) 23.2 (20.9-25.4) 24.4
3 (23.7-26.4) 25.4 (24.0-26.7) 27.0
4 (24.1-28.1) 25.9 (23.9-27.8) 26.0
5 (20.5-23.1) 22.3 (21.0-23.6) 25.1
6 (21.4-25.1) 24.8 (22.9-26.6) 27.4
National_BMI_2017_CI Rural_BMI_2017 Rural_BMI_2017_CI Urban_BMI_2017
1 (20.3-25.3) 22.5 (20.0-25.0) 23.6
2 (23.3-25.4) 23.6 (22.5-24.8) 26.3
3 (26.0-27.9) 26.9 (25.9-27.9) 27.0
4 (24.8-27.2) 26.2 (24.8-27.5) 25.9
5 (24.5-25.7) 24.8 (24.1-25.4) 25.2
6 (26.7-28.0) 27.0 (26.3-27.8) 27.5
Urban_BMI_2017_CI
1 (21.0-26.1)
2 (25.1-27.4)
3 (26.0-28.0)
4 (24.6-27.2)
5 (24.6-25.9)
6 (26.7-28.2)
Our data is looking great! Now we might want to make sure that our observations for each variable look the way we want. In other words if we want to make plots about National BMI in 1985 then we would need our values to be numeric. Looking at our BMI data using str(), we can see the type of data for each of our variables listed just after variable name and the “:” colon. Looks like none of our BMI data is actually numeric, but of the class character. Let’s change that now.
'data.frame': 400 obs. of 14 variables:
$ Country : Factor w/ 400 levels "Afghanistan",..: 1 1 2 2 3 3 4 4 5 5 ...
$ Sex : chr "Men" "Women" "Men" "Women" ...
$ National_BMI_1985 : chr "20.2" "20.6" "25.2" "26.0" ...
$ National_BMI_1985_CI: chr "(17.8-22.7)" "(18.4-22.8)" "(23.9-26.5)" "(24.1-27.9)" ...
$ Rural_BMI_1985 : chr "19.7" "20.1" "25.0" "26.1" ...
$ Rural_BMI_1985_CI : chr "(17.2-22.2)" "(17.8-22.4)" "(23.7-26.4)" "(24.1-28.1)" ...
$ Urban_BMI_1985 : chr "22.4" "23.2" "25.4" "25.9" ...
$ Urban_BMI_1985_CI : chr "(20.0-25.0)" "(20.9-25.4)" "(24.0-26.7)" "(23.9-27.8)" ...
$ National_BMI_2017 : chr "22.8" "24.4" "27.0" "26.0" ...
$ National_BMI_2017_CI: chr "(20.3-25.3)" "(23.3-25.4)" "(26.0-27.9)" "(24.8-27.2)" ...
$ Rural_BMI_2017 : chr "22.5" "23.6" "26.9" "26.2" ...
$ Rural_BMI_2017_CI : chr "(20.0-25.0)" "(22.5-24.8)" "(25.9-27.9)" "(24.8-27.5)" ...
$ Urban_BMI_2017 : chr "23.6" "26.3" "27.0" "25.9" ...
$ Urban_BMI_2017_CI : chr "(21.0-26.1)" "(25.1-27.4)" "(26.0-28.0)" "(24.6-27.2)" ...
We could change these values to be numeric with 6 lines of code like this:
BMI$National_BMI_1985%<>% as.numeric
BMI$Rural_BMI_1985%<>% as.numeric
BMI$Urban_BMI_1985%<>% as.numeric
BMI$National_BMI_2017%<>% as.numeric
BMI$Rural_BMI_2017%<>% as.numeric
BMI$Urban_BMI_2017%<>% as.numeric
# And if we did this we would see these variables show as "num" which stands for numeric
str(BMI)'data.frame': 400 obs. of 14 variables:
$ Country : Factor w/ 400 levels "Afghanistan",..: 1 1 2 2 3 3 4 4 5 5 ...
$ Sex : chr "Men" "Women" "Men" "Women" ...
$ National_BMI_1985 : num 20.2 20.6 25.2 26 22.1 24 33.7 34.3 25 25.2 ...
$ National_BMI_1985_CI: chr "(17.8-22.7)" "(18.4-22.8)" "(23.9-26.5)" "(24.1-27.9)" ...
$ Rural_BMI_1985 : num 19.7 20.1 25 26.1 21.8 23.3 32.6 34.1 25.3 25.4 ...
$ Rural_BMI_1985_CI : chr "(17.2-22.2)" "(17.8-22.4)" "(23.7-26.4)" "(24.1-28.1)" ...
$ Urban_BMI_1985 : num 22.4 23.2 25.4 25.9 22.3 24.8 34 34.4 25 25.2 ...
$ Urban_BMI_1985_CI : chr "(20.0-25.0)" "(20.9-25.4)" "(24.0-26.7)" "(23.9-27.8)" ...
$ National_BMI_2017 : num 22.8 24.4 27 26 25.1 27.4 34.3 35.3 26.8 25.3 ...
$ National_BMI_2017_CI: chr "(20.3-25.3)" "(23.3-25.4)" "(26.0-27.9)" "(24.8-27.2)" ...
$ Rural_BMI_2017 : num 22.5 23.6 26.9 26.2 24.8 27 34.6 35 26.8 25.2 ...
$ Rural_BMI_2017_CI : chr "(20.0-25.0)" "(22.5-24.8)" "(25.9-27.9)" "(24.8-27.5)" ...
$ Urban_BMI_2017 : num 23.6 26.3 27 25.9 25.2 27.5 34.2 35.4 26.8 25.3 ...
$ Urban_BMI_2017_CI : chr "(21.0-26.1)" "(25.1-27.4)" "(26.0-28.0)" "(24.6-27.2)" ...
Or we can use a more automated way with the map() function of the purrr package:
BMI_numeric <- as_tibble(purrr::map((BMI %>% select(-matches("CI|Sex|Country"))), as.numeric))
# The map function of the purrr package allows us to apply the as.numeric() function to all the selected columns of BMI
# We have selected those that dont contain CI, Sex, or Country in the column name
# If you are familiar with apply(), this function is very similar
BMI_numeric %<>% mutate(Country =BMI$Country, Sex =BMI$Sex)
head(BMI_numeric)# A tibble: 6 x 8
National_BMI_19… Rural_BMI_1985 Urban_BMI_1985 National_BMI_20…
<dbl> <dbl> <dbl> <dbl>
1 20.2 19.7 22.4 22.8
2 20.6 20.1 23.2 24.4
3 25.2 25 25.4 27
4 26 26.1 25.9 26
5 22.1 21.8 22.3 25.1
6 24 23.3 24.8 27.4
# … with 4 more variables: Rural_BMI_2017 <dbl>, Urban_BMI_2017 <dbl>,
# Country <fct>, Sex <chr>
It is generally useful to get the data in what is called long format for other analyses, and particularly for plotting.
For a more detailed description about this, please see this case study
To do this we will use the gather() function of the tidyr package:
BMI_long <-tidyr::gather(data =BMI_numeric, key = class_BMI, value = BMI, National_BMI_1985:Urban_BMI_2017, factor_key = FALSE)
# The data value indicates what data you start with
# The key indicates the new name of the column that will now contian information about the values of the columns we will put together
# The value is the name of the column for the values for all the columns that are being put together
# The ":" can be used to indentify the start and end of the column names that you are putting together
# The factor_key paramater determines if the newly created key column should be evaluated as a factor or not
head(BMI_long)# A tibble: 6 x 4
Country Sex class_BMI BMI
<fct> <chr> <chr> <dbl>
1 Afghanistan Men National_BMI_1985 20.2
2 Afghanistan Women National_BMI_1985 20.6
3 Albania Men National_BMI_1985 25.2
4 Albania Women National_BMI_1985 26
5 Algeria Men National_BMI_1985 22.1
6 Algeria Women National_BMI_1985 24
It would be useful to parse the 1985 and the 2017 data. We can do so by separating the parts of the class_BMI column using the separate() function of the tidyr package.
BMI_long%<>% tidyr::separate(class_BMI, c("Region", NA, "Year"))
# We will separate the class_BMI column of the BMI_long tibble based on the underscores and create two new columns.
# The first column will be called Region and will contain the first part of the class_BMI data before the first underscore
# The second column will be called year and will contain the third part of the data based on the divisions of the underscore
# The middle part of the column will not be used to create any new columns
head(BMI_long)# A tibble: 6 x 5
Country Sex Region Year BMI
<fct> <chr> <chr> <chr> <dbl>
1 Afghanistan Men National 1985 20.2
2 Afghanistan Women National 1985 20.6
3 Albania Men National 1985 25.2
4 Albania Women National 1985 26
5 Algeria Men National 1985 22.1
6 Algeria Women National 1985 24
# Let's see how the dimensions of the BMI data have changed now that it is in long format:
dim(BMI_numeric)[1] 400 8
[1] 2400 5
Great! our data is very usable now in this format!
Now that our data is clean and in a format that we can work with, we can start to take a look at the data and how different groups might compare.
Statistically speaking there are tests that can allow us to evaluate if the means of the groups are different. First let’s see how our data looks in general.
National_BMI_1985 Rural_BMI_1985 Urban_BMI_1985 National_BMI_2017
Min. :18.20 Min. :17.7 Min. :19.30 Min. :20.20
1st Qu.:21.50 1st Qu.:20.9 1st Qu.:22.43 1st Qu.:24.07
Median :24.00 Median :23.4 Median :24.30 Median :26.15
Mean :23.68 Mean :23.3 Mean :24.24 Mean :26.00
3rd Qu.:25.23 3rd Qu.:25.1 3rd Qu.:25.40 3rd Qu.:27.43
Max. :34.30 Max. :34.1 Max. :34.40 Max. :35.30
NA's :6 NA's :2
Rural_BMI_2017 Urban_BMI_2017 Country Sex
Min. :19.80 Min. :21.50 Afghanistan : 2 Length:400
1st Qu.:23.30 1st Qu.:24.80 Albania : 2 Class :character
Median :26.00 Median :26.30 Algeria : 2 Mode :character
Mean :25.61 Mean :26.37 American Samoa: 2
3rd Qu.:27.30 3rd Qu.:27.60 Andorra : 2
Max. :35.00 Max. :35.40 Angola : 2
NA's :8 NA's :2 (Other) :388
Lets look at the Data separately by gender:
National_BMI_1985 Rural_BMI_1985 Urban_BMI_1985 National_BMI_2017
Min. :18.20 Min. :17.70 Min. :19.70 Min. :21.00
1st Qu.:22.00 1st Qu.:21.20 1st Qu.:22.85 1st Qu.:24.40
Median :24.25 Median :23.90 Median :24.60 Median :26.10
Mean :24.00 Mean :23.59 Mean :24.63 Mean :26.38
3rd Qu.:25.50 3rd Qu.:25.40 3rd Qu.:25.60 3rd Qu.:28.00
Max. :34.30 Max. :34.10 Max. :34.40 Max. :35.30
NA's :3 NA's :1
Rural_BMI_2017 Urban_BMI_2017 Country Sex
Min. :20.40 Min. :21.60 Afghanistan : 1 Length:200
1st Qu.:23.70 1st Qu.:25.30 Albania : 1 Class :character
Median :26.20 Median :26.40 Algeria : 1 Mode :character
Mean :25.98 Mean :26.84 American Samoa: 1
3rd Qu.:27.60 3rd Qu.:28.25 Andorra : 1
Max. :35.00 Max. :35.40 Angola : 1
NA's :4 NA's :1 (Other) :194
National_BMI_1985 Rural_BMI_1985 Urban_BMI_1985 National_BMI_2017
Min. :18.50 Min. :18.40 Min. :19.30 Min. :20.20
1st Qu.:21.10 1st Qu.:20.80 1st Qu.:22.10 1st Qu.:23.30
Median :23.60 Median :23.20 Median :24.10 Median :26.20
Mean :23.36 Mean :23.01 Mean :23.84 Mean :25.61
3rd Qu.:25.00 3rd Qu.:24.90 3rd Qu.:25.10 3rd Qu.:27.10
Max. :33.70 Max. :32.60 Max. :34.00 Max. :34.30
NA's :3 NA's :1
Rural_BMI_2017 Urban_BMI_2017 Country Sex
Min. :19.80 Min. :21.5 Afghanistan : 1 Length:200
1st Qu.:22.60 1st Qu.:23.8 Albania : 1 Class :character
Median :25.75 Median :26.3 Algeria : 1 Mode :character
Mean :25.24 Mean :25.9 American Samoa: 1
3rd Qu.:27.00 3rd Qu.:27.2 Andorra : 1
Max. :34.60 Max. :34.2 Angola : 1
NA's :4 NA's :1 (Other) :194
It looks like mean BMIs have increased in all regions for both men and women. It is unclear though if this change is statistically significant.
In order to apply a statistical test to compare the means, one of the first things that is useful to do is to plot the frequency of the different possible values. This is called a distribution. To do this we can use the hist() function to create a histogram.
If the data were what we call normally distributed then the distribution should be equally centered around the mean and would look something like this:
norm_BMI_ex_data <- tibble(norm_data =rnorm(n = 200, mean = 24))
# The rnorm() function allows us to make a vector of normaly distributed data centered around the mean of 24
head(norm_BMI_ex_data)# A tibble: 6 x 1
norm_data
<dbl>
1 23.0
2 24.5
3 25.1
4 25.9
5 23.9
6 24.1
Alternatively we can plot this using the geom_hist() function of the ggplot2 package. The ggplot2 package creates plots by using layers. Notice in the following code how their is a plus sign between the ggplot() function and the geom_histogram() function. With ggplot2 we select what data we would like to plot using the first function (ggplot()) and then we add on additional layers of complexity. In this case the geom_histrogram() function initiates the plotting of a histogram with the data that was identified in the ggplot() function. We will see later how we can add many layers to plots with ggplot2. For additional information on using ggplot2, see this case study.
# We are using the column called norm_data within the norm_BMI_ex_data data object
# With ggplot2 we must first specify the data we are using with the ggplot() function
# The geom_histogram() function specifies what type of plot to createLet’s see how our data looks:
OK, so the data looks like it is what we call right skewed because the tail of the distribution is longer on the right side, but it looks fairly normally distributed.
If we parse our data, what does it look like? The easiest way to do this is to use some other functions of the ggplot2 package, particularly the facet_wrap() function, which will allow us to look at differences in the distribution of the BMI data by year, gender, and region. We can sequentially divide our plots by deeper levels using multiple variables and the + plus sign within facet_wrap().
### Quantile Quantile Plots
Let’s use a method called quantile quantile plotting to determine if the data is indeed normally distributed. These plots are called QQ plots. This method allows us to test the fit of known theoretical distributions (like the normal distribution) with our observed distribution. To do this we will plot the quantiles of our data on the y-axis and the quantiles of the theoretical normal distribution on the x-axis. If the quantiles line up then we can say that our data is fairly normal. What exactly is a quantile? This is a division of the data distribution into roughly equal portions.
Here is an example of a QQ plot for the normally distributed data that we just created using the stat_qq() function of the ggplot2 package:
norm_BMI_ex_data %>%
ggplot(aes(sample =norm_data)) +
stat_qq()+ # to add the line we need to also include stat_qq_line()
stat_qq_line()Here we can see that the quantiles are fairly similar between the observed and theoretical data. We see that the points mostly fall on the line, however there are some points that are a bit further from the line as we get to the extreme quantiles. Notice that the sample quantiles (which will be fairly similar to our real BMI data quantiles) on the y-axis has the same range as the values that we created. So values that are bellow 22 for example are represented as the points bellow 22 on the y-axis. As expected we see that about half the points are bellow our mean of 24.
If we were to use different data that had a range of different values our y-axis would shift according to the range of values. For example the Orange data within the installation of R includes data about the circumference of orange trees in millimeters. Here we can see that the quantiles are quite different but reflect the range of orange tree circumferences.
[1] 30 214
Orange%>%
ggplot(aes(sample =circumference)) +
stat_qq()+ # to add the line we need to also include stat_qq_line()
stat_qq_line()Let’s take a look at our BMI data:
BMI_long %>%
ggplot(aes(sample =BMI)) +
stat_qq()+ # to add the line we need to also include stat_qq_line()
stat_qq_line()For the sake of simplicity, we are going to focus on the data from the women. This is because women were identified in the paper to have larger increases in BMI. If we want to perform tests on these groups specifically, then we need to know how this data looks.
BMI_long %>%
filter(Sex == "Women") %>%
ggplot(aes(sample =BMI)) +
stat_qq()+ # to add the line we need to also include stat_qq_line()
stat_qq_line()What about by region and year? If we compare these data then we need to look at the distribution of each of these subsets.
BMI_long %>%
filter(Sex == "Women", Year == "2017", Region == "Rural") %>%
ggplot(aes(sample =BMI)) +
stat_qq() + # to add the line we need to also include stat_qq_line()
stat_qq_line() + # to add a title we can use ggtitle()
ggtitle("QQ Plot for Women 2017 Rural Data")BMI_long %>%
filter(Sex == "Women", Year == "2017", Region == "Urban") %>%
ggplot(aes(sample =BMI)) +
stat_qq() + # to add the line we need to also include stat_qq_line()
stat_qq_line() +
ggtitle("QQ Plot for Women 2017 Urban Data")BMI_long %>%
filter(Sex == "Women", Year == "1985", Region == "Rural") %>%
ggplot(aes(sample =BMI)) +
stat_qq() + # to add the line we need to also include stat_qq_line()
stat_qq_line() +
ggtitle("QQ Plot for Women 1985 Rural Data")BMI_long %>%
filter(Sex == "Women", Year == "1985", Region == "Urban") %>%
ggplot(aes(sample =BMI)) +
stat_qq() + # to add the line we need to also include stat_qq_line()
stat_qq_line() +
ggtitle("QQ Plot for Women 1985 Urban Data")We can see that at the extremes of our quantiles, for most of our data, our tails are not very similar to the theoretical distribution. The Rural data looks more normal than the urban data, but if we were to use statistical tests that rely on normality, this would require that both groups are normally distributed.
Finally, some statisticians also use the Shapiro-Wilk test for normality when the QQ plot is a bit unclear. Many statisticians however would conclude from the QQ plots that normality has been violated.
For illustrative purposes we will show how we can use the dplyr summarize() and group_by() functions to perform the Shapiro-Wilk test for all the subsets we are interested in.
Shapiro-Wilk normality test
data: .
W = 0.99353, p-value = 0.5325
# example of the test output for normal data
BMI_long %>%
filter(Sex == "Women") %>%
group_by(Year, Region) %>%
summarize(shapiro_test = shapiro.test(BMI)$p.value)# A tibble: 6 x 3
# Groups: Year [2]
Year Region shapiro_test
<chr> <chr> <dbl>
1 1985 National 0.0000478
2 1985 Rural 0.000679
3 1985 Urban 0.000000332
4 2017 National 0.00363
5 2017 Rural 0.0108
6 2017 Urban 0.0000130
We see that all the data does not appear to be normally distributed.
We are interested in comparing the means of female rural and urban BMI measurements for both years. There are two possible classes of statistical tests that we could run to compare the means of these two groups:
Often when comparing two groups we might perform a two sample t test to determine if the means of each group is different. The two sample t test however, relies on several assumptions:
If these assumptions are violated, this doesn’t necessarily mean we can’t perform a t test. It just means we may have to transform the data to make it normally distributed and we may need to perform the t test in a special way to account for the difference in the variance in the two groups. Alternatively, we can use a nonparametric test like the Wilcoxon–Mann–Whitney (WMW) test. We will explore both of these options.
The t test is also fairly robust to non-normality if the data is relatively large, and we have an n of 200, which should be sufficient but let’s investigate the nonparametric tests further.
Often we would check if the variance of the rural and urban data is equal using the var.test() function. However this is an F test and assumes that the data is normally distributed. Instead we will use the mood.test() function which performs the Mood’s two-sample test for a difference in scale parameters and does not assume that the data is normally distributed.
mood.test(pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI))
Mood two-sample test of scale
data: pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Urban"), BMI)
Z = 2.9189, p-value = 0.003513
alternative hypothesis: two.sided
# not equal - p value <.05 reject the null: no difference in the variance of the distributions
mood.test(pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Urban"), BMI))
Mood two-sample test of scale
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI) and "Urban"), BMI)
Z = 3.1305, p-value = 0.001745
alternative hypothesis: two.sided
# not equal - p value <.05, reject the null: no difference in the variance of the distributions
mood.test(pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI))
Mood two-sample test of scale
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Rural"), BMI)
Z = -0.24228, p-value = 0.8086
alternative hypothesis: two.sided
# equal - p value >.05, accept the null: no difference in the variance of the distributions
mood.test(pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Urban"), BMI),
pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI))
Mood two-sample test of scale
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI) and "Urban"), BMI)
Z = 1.5317, p-value = 0.1256
alternative hypothesis: two.sided
Our p value is less than .05 for both tests, thus we reject our null hypothesis that there is no difference in the variance. Therefore, we conclude that the variance is not equal and that our data also violates this assumption.
We will perform a special t.test where we account for the fact that our variance is not equal.
Another important consideration is that the data is what we call paired. Meaning the measurements from the rural and urban areas are not independent. That is because we have a rural and urban measurement mean for nearly every country. Thus these values may be more similar to one another if they come from the same country.
t.test(pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Urban"), BMI)
t = -10.356, df = 194, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.0573625 -0.7190478
sample estimates:
mean of the differences
-0.8882051
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Urban"), BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI) and "Urban"), BMI)
t = -14.095, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.1870263 -0.8956268
sample estimates:
mean of the differences
-1.041327
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Rural"), BMI)
t = -22.119, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.591762 -2.167422
sample estimates:
mean of the differences
-2.379592
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Urban"), BMI),
pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI) and "Urban"), BMI)
t = -24.378, df = 198, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.383938 -2.027118
sample estimates:
mean of the differences
-2.205528
Now we will try transform our data to make it more normally distributed. One way to do this is to take the logarithm of the data values. Then we will see how this influences the results.
# create a new column with the log version of the BMI variable
BMI_long_log <- mutate(BMI_long, log_BMI=log(pull(BMI_long,BMI)))
BMI_long_log %>%
filter(Sex == "Women", Year == "2017", Region == "Rural") %>%
ggplot(aes(sample =log_BMI)) +
stat_qq() + # to add the line we need to also include stat_qq_line()
stat_qq_line() +
ggtitle("QQ Plot for Log Women 2017 Rural Data")BMI_long_log %>%
filter(Sex == "Women", Year == "2017", Region == "Urban") %>%
ggplot(aes(sample =log_BMI)) +
stat_qq() + # to add the line we need to also include stat_qq_line()
stat_qq_line() +
ggtitle("QQ Plot for Log Women 2017 Urban Data")BMI_long_log %>%
filter(Sex == "Women", Year == "1985", Region == "Rural") %>%
ggplot(aes(sample =log_BMI)) +
stat_qq() + # to add the line we need to also include stat_qq_line()
stat_qq_line() +
ggtitle("QQ Plot for Log Women 1985 Rural Data")BMI_long_log %>%
filter(Sex == "Women", Year == "1985", Region == "Urban") %>%
ggplot(aes(sample =log_BMI)) +
stat_qq() + # to add the line we need to also include stat_qq_line()
stat_qq_line() +
ggtitle("QQ Plot for Log Women 1985 Urban Data")BMI_long_log %>%
filter(Sex == "Women") %>%
group_by(Year, Region) %>%
summarize(shapiro_test = shapiro.test(log_BMI)$p.value)# A tibble: 6 x 3
# Groups: Year [2]
Year Region shapiro_test
<chr> <chr> <dbl>
1 1985 National 0.00315
2 1985 Rural 0.0105
3 1985 Urban 0.000334
4 2017 National 0.293
5 2017 Rural 0.0784
6 2017 Urban 0.00416
# recall what it was before
BMI_long %>%
filter(Sex == "Women") %>%
group_by(Year, Region) %>%
summarize(shapiro_test = shapiro.test(BMI)$p.value)# A tibble: 6 x 3
# Groups: Year [2]
Year Region shapiro_test
<chr> <chr> <dbl>
1 1985 National 0.0000478
2 1985 Rural 0.000679
3 1985 Urban 0.000000332
4 2017 National 0.00363
5 2017 Rural 0.0108
6 2017 Urban 0.0000130
Let’s see the results of the t test with the transformed data:
t.test(pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Rural"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Urban"), log_BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Rural"), log_BMI) and "Urban"), log_BMI)
t = -10.058, df = 194, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.04242774 -0.02851589
sample estimates:
mean of the differences
-0.03547182
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == "Rural"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == "Urban"), log_BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == "Rural"), log_BMI) and "Urban"), log_BMI)
t = -13.962, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.05214677 -0.03923811
sample estimates:
mean of the differences
-0.04569244
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == "Rural"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Rural"), log_BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Rural"), log_BMI) and "Rural"), log_BMI)
t = -22.369, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.10617051 -0.08896626
sample estimates:
mean of the differences
-0.09756839
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == "Urban"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Urban"), log_BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Urban"), log_BMI) and "Urban"), log_BMI)
t = -23.977, df = 198, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.09377834 -0.07952498
sample estimates:
mean of the differences
-0.08665166
The data appears to be more similar to the normal distribution, but most subsets still fail the Shapiro-Wilk normality test. Again, our sample size of 200 is quite large and the t test is generally quite robust to violations of normality with large n, thus the modified t test to account for unequal variance might be a good option using the log normalized data, as it is at least more normally distributed. However, let’s take a look at non parametric tests, which are also a great option when the assumptions of the t test are violated.
There are two options to consider when the assumptions of the t test are violated. The Wilcoxon signed rank test and the Two-sample Kolmogorov-Smirnov test both do not assume normality. Thus these tests should be considered when the data of either groups does not appear to be normally distributed and particularly when the number of samples is low.
Importantly the Kolmogorov-Smirnov test does not assume normality or equal variance, while the Wilcoxon signed rank test does assume equal variance. Here is how you would perform these tests. However in our case, because the variance is not equal between our groups of interest, the Kolmogorov-Smirnov test would be more appropriate.
#wilcox.test(pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI),
# pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI),
# paired = TRUE)
#wilcox.test(pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI),
# pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Urban"), BMI),
# paired = TRUE))
ks.test(pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI),
paired = TRUE)
Two-sample Kolmogorov-Smirnov test
data: pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Urban"), BMI)
D = 0.20006, p-value = 0.0007385
alternative hypothesis: two-sided
ks.test(pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Urban"), BMI),
paired = TRUE)
Two-sample Kolmogorov-Smirnov test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI) and "Urban"), BMI)
D = 0.19914, p-value = 0.0007779
alternative hypothesis: two-sided
##What about the difference in female BMI from 1985 to 2017 for both regions?
ks.test(pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI),
paired = TRUE)
Two-sample Kolmogorov-Smirnov test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Rural"), BMI)
D = 0.38359, p-value = 5.553e-13
alternative hypothesis: two-sided
ks.test(pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Urban"), BMI),
pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI),
paired = TRUE)
Two-sample Kolmogorov-Smirnov test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI) and "Urban"), BMI)
D = 0.43216, p-value < 2.2e-16
alternative hypothesis: two-sided
Again we will utilize ggplot2 to create plots.If you need additional information please see here. The top two lines of the code for the following plots, filter the data to only specific values of interest. Then we layer what is called a jitter on top of a box plot. A jitter is essentially a dot plot but with some variation on the location of the points so that they do not line up vertically which can make the individual points difficult to see.
Let’s look at the national mean BMI estimates for each of the years:
BMI_long %>%
filter(Sex == "Women", Year == "1985", Region != "National") %>%
ggplot(aes(x = Region, y = BMI)) +
geom_boxplot() +
geom_jitter(width = .3) # the width determines how wide the points are plottedBMI_long %>%
filter(Sex == "Women", Year == "2017", Region != "National") %>%
ggplot(aes(x = Region, y = BMI)) +
geom_boxplot() +
geom_jitter(width = .3)Now Let’s look at the change in rural and urban mean BMI estimates:
BMI_long %>%
filter(Sex == "Women", Year %in% c("1985", "2017"), Region == "Rural") %>%
# filtering this way allows us to keep data from both years
ggplot(aes(x = Year, y = BMI)) +
geom_boxplot() +
geom_jitter(width = .3)BMI_long %>%
filter(Sex == "Women", Year %in% c("1985", "2017"), Region == "Urban") %>%
ggplot(aes(x = Year, y = BMI)) +
geom_boxplot() +
geom_jitter(width = .3)How do the different countries compare? Or in other words what do the individual dots represent in our box plots? We will take a look using geom_label():
BMI_long %>%
filter(Sex == "Women", Year %in% c("1985", "2017"), Region == "Rural") %>%
ggplot(aes(x = Year, y = BMI)) +
geom_boxplot() +
geom_jitter(width = .3)+
geom_label(aes(label = Country))If we include all country names this is a bit too much… so perhaps we should focus on just the extreme BMI values using filter() function of thedplyr package. We will also use the ggrepel package to have our labels not overlap each other.
BMI_long %>%
filter(Sex == "Women", Year %in% c("1985", "2017"), Region == "Rural") %>%
ggplot(aes(x = Year, y = BMI)) +
geom_boxplot() +
geom_jitter(data=BMI_long %>%
filter(Sex == "Women", Year %in% c("1985", "2017"), Region == "Rural") %>%
filter(BMI>31 | BMI<19 | Country == "United States of America"),
aes(x =Year, y =BMI),
width = .02) +
ggrepel::geom_text_repel(data=BMI_long %>%
filter(Sex == "Women", Year %in% c("1985", "2017"), Region == "Rural") %>%
filter(BMI>31 | BMI<19 | Country == "United States of America"),
aes(x =Year, y =BMI, label = Country))And let’s fill the box plots with color and outline in black:
BMI_long %>%
filter(Sex == "Women", Year %in% c("1985", "2017"), Region == "Rural") %>%
ggplot(aes(x = Year, y = BMI)) +
geom_boxplot(outlier.shape = NA, color = "black", aes(fill = Year)) +
geom_jitter(data=BMI_long %>%
filter(Sex == "Women", Year %in% c("1985", "2017"), Region == "Rural") %>%
subset(BMI>31 | BMI<19 | Country == "United States of America"),
aes(x =Year, y =BMI),
width = .02) +
ggrepel::geom_text_repel(data=BMI_long %>%
filter(Sex == "Women", Year %in% c("1985", "2017"), Region == "Rural") %>%
subset(BMI>31 | BMI<19 | Country == "United States of America"),
aes(x =Year, y =BMI, label = Country))Let’s take a look at all the data together:
That’s useful, but let’s look at the individual points and include our country labels, to do so lets change our United States of America label to USA:
BMI_long$Country <-BMI_long$Country %>%str_replace( pattern = "United States of America", replacement = "USA")
ggplot(BMI_long, aes(x = Year, y = BMI, col = Year)) +
geom_boxplot(outlier.shape = NA, color = "black" , aes(fill = Year)) +
facet_grid(~Sex + Region) +
geom_hline(yintercept=30, linetype="dashed", color = "red", size =1) +
# This will add a horizontal dashed line to indicate BMI values considered to be in the range of obesity
geom_jitter(data=subset(BMI_long), aes(x =Year, y =BMI), width = .2, size =2, shape =21, color = "black", fill = "gray") +
# This will add the individual country data points
# The shape 21 allows for a different fill and outline color ( the outline is the "color")
# The width determines how wide the jitter points are plotted
geom_jitter(data=subset(BMI_long, Country == "USA"), aes(x =Year, y =BMI), width = .02, size =12, shape =21, color = "black", fill = "gray") +
# This will add points that are larger for the USA data
geom_text(data=subset(BMI_long, Country == "USA"), aes(x =Year, y =BMI,label=Country), color = "black") +
# This will add USA labels to the USA points
theme(legend.position = "none",
# This is useful for removing the legend
axis.text.x = element_text(size = 15,angle = 30),
# this changes the size and angle of the x axis point labels
axis.text.y = element_text(size = 20),
axis.title.y = element_text(size =15),
axis.title.x = element_text(size =15),
strip.text.x = element_text(size = 15)) +
# This changes the size of x axis labels for the facet
ggtitle( "Differences in BMI Over Time and Across Region Type and Gender") # Add a plot titleHere we can see that overall BMI appears to be increasing globally over time. Additionally we can see that this is occurring not just in urban areas, but also in rural areas. The US is consistently above the median in all strata of the data. In general, the female data shows higher BMI values than the male data. The rural USA BMI appears to be higher than the urban BMI for both men and women. Many countries have average BMI estimates above the obesity threshold of 30. Thus it appears that education and outreach programs for weight management should focus on both rural and urban areas and both genders. Education and assistance for women may be especially helpful.
How does the rate of increase in BMI differ between groups? Which group might especially need attention?
First let’s calculate the differences in BMI from 2017 and 1985 and add this to our BMI_long data object:
BMI_numeric$Rural_difference <- BMI_numeric$Rural_BMI_2017 - BMI_numeric$Rural_BMI_1985
BMI_numeric$Urban_difference <- BMI_numeric$Urban_BMI_2017 - BMI_numeric$Urban_BMI_1985
BMI_numeric$National_difference <- BMI_numeric$National_BMI_2017 - BMI_numeric$National_BMI_1985
BMI_diff_long <- BMI_numeric %>% select(Country: National_difference) %>% gather(key = Type, value = Difference , Rural_difference:National_difference)
head(BMI_diff_long)# A tibble: 6 x 4
Country Sex Type Difference
<fct> <chr> <chr> <dbl>
1 Afghanistan Men Rural_difference 2.8
2 Afghanistan Women Rural_difference 3.5
3 Albania Men Rural_difference 1.90
4 Albania Women Rural_difference 0.1000
5 Algeria Men Rural_difference 3
6 Algeria Women Rural_difference 3.70
Let’s replace “United states of America” with “USA” and make a plot with this data to compare the change in BMI:
BMI_diff_long$Country <-BMI_diff_long$Country %>%str_replace( pattern = "United States of America", replacement = "USA")
ggplot(BMI_diff_long, aes(x = Type, y = Difference, col = Type)) +
geom_boxplot(outlier.shape = NA, color = "black" , aes(fill = Type)) +
facet_grid(~Sex) +
geom_jitter(data=subset(BMI_diff_long), aes(x =Type, y =Difference), width = .2, size =2, shape =21, color = "black", fill = "gray") +
# This will add the individual country data points
# The shape 21 allows for a different fill and outline color ( the outline is the "color")
# The width determines how wide the jitter points are plotted
geom_jitter(data=subset(BMI_diff_long, Country == "USA"), aes(x =Type, y =Difference), width = .02, size =12, shape =21, color = "black", fill = "gray") +
# This will add points that are larger for the USA data
geom_text(data=subset(BMI_diff_long, Country == "USA"), aes(x =Type, y =Difference,label=Country), color = "black") +
# This will add USA labels to the USA points
theme(legend.position = "none",
# This is useful for removing the legend
axis.text.x = element_text(size = 15,angle = 30),
# this changes the size and angle of the x axis point labels
axis.text.y = element_text(size = 20),
axis.title.y = element_text(size =15),
axis.title.x = element_text(size =15),
strip.text.x = element_text(size = 15)) +
# this changes the size of x axis labels for the facet
ggtitle( "Change in BMI Over Time and Across Region Type and Gender" )# Add a plot title We can now see that the rate of change appears to be larger in the women compared to the men. The group with the largest increase in the USA is the women living in rural areas. Thus this group should be especially focused on to improve this public health issue.
Now let’s make a plot that summarizes the last two plots. To do this we will simplify the other two plots and then combine them together.
# simplified national means plot
Means_plot<-BMI_long %>%
filter(Sex %in% c("Men", "Women"), Year %in% c("1985", "2017"), Region == "National") %>%
ggplot(aes(x = Sex, y = BMI)) +
geom_boxplot(outlier.shape = NA, color = "black" , aes(fill = Sex)) +
scale_fill_manual(values=c("dodgerblue", "orchid2"))+
facet_grid(~Year) +
geom_hline(yintercept=30, linetype="dashed", color = "red", size =1) +
geom_jitter(data=BMI_long
%>%filter(Sex %in% c("Men", "Women"), Year %in% c("1985", "2017"), Region == "National"),
aes(x =Sex, y =BMI), width = .2, size =2, shape =21, color = "black", fill = "gray") +
geom_jitter(data=subset(BMI_long
%>%filter(Sex %in% c("Men", "Women"), Year %in% c("1985", "2017"), Region == "National"),
Country == "USA"), aes(x =Sex, y =BMI), width = .02, size =12, shape =21, color = "black", fill = "gray") +
# This will add points that are larger for the USA data
geom_text(data=subset(BMI_long
%>%filter(Sex %in% c("Men", "Women"), Year %in% c("1985", "2017"), Region == "National"),
Country == "USA"), aes(x =Sex, y =BMI,label=Country), color = "black") +
# This will add USA labels to the USA points
theme(legend.position = "none",
# This is useful for removing the legend
axis.text.x = element_text(size = 15,angle = 30),
# this changes the size and angle of the x axis point labels
axis.text.y = element_text(size = 20),
axis.title.y = element_text(size =15),
axis.title.x = element_text(size =15),
strip.text.x = element_text(size = 15))
#Simplified difference plot
BMI_diff_long$Type <-BMI_diff_long$Type %>%str_replace( pattern = "_difference", replacement = "")
Diff_plot<-BMI_diff_long %>%
filter(Sex %in% c("Men", "Women"), Type != "National") %>%
ggplot( aes(x = Type, y = Difference, col = Type)) +
geom_boxplot(outlier.shape = NA, color = "black" , aes(fill = Sex)) +
scale_fill_manual(values=c("dodgerblue", "orchid2"))+
facet_grid(~Sex) +
geom_jitter(data=BMI_diff_long %>%
filter(Sex %in% c("Men", "Women"), Type != "National") ,
aes(x =Type, y =Difference), width = .2, size =2, shape =21, color = "black", fill = "gray") +
geom_jitter(data=subset(BMI_diff_long %>%
filter(Sex %in% c("Men", "Women"), Type != "National") , Country == "USA"),
aes(x =Type, y =Difference), width = .02, size =12, shape =21, color = "black", fill = "gray") +
# This will add points that are larger for the USA data
geom_text(data=subset(BMI_diff_long %>%
filter(Sex %in% c("Men", "Women"), Type != "National") , Country == "USA"),
aes(x =Type, y =Difference,label=Country), color = "black") +
# This will add USA labels to the USA points
theme(legend.position = "none",
# This is useful for removing the legend
axis.text.x = element_text(size = 15,angle = 30),
# this changes the size and angle of the x axis point labels
axis.text.y = element_text(size = 20),
axis.title.y = element_text(size =15),
axis.title.x = element_text(size =15),
strip.text.x = element_text(size = 15)) # this changes the size of x axis labels for the facet
Diff_plot<-Diff_plot + labs(title = "Change in BMI by region", x = "", y = "Increase in BMI \n (1985 to 2017)")+
theme(title = element_text (size = 12, face = "bold"))
Means_plot <-Means_plot + labs(title = "Mean BMI over time", x = "", y = "Mean BMI")+
theme(title = element_text (size = 12, face = "bold"))
cowplot::plot_grid(Means_plot, Diff_plot, labels = c("A", "B"))Great, now we have put two plots together using the plot_grid() function of the cowplot package. This way we can clearly communicate two messages. The first being that BMI has increased over time globally and that many countries including the United States of America are approaching a mean BMI that is above the obesity threshold of 30. We can also see that Women have larger BMI values, but that both genders show increased levels over time. In the second plot we can see that the increase in BMI is not just happening in urban communities, but in both rural and urban communities. In fact, Women in rural areas of the United States of America appear to have the largest increase in BMI.
We have evaluated BMI average estimates from 200 different countries around the world. To do so we imported data from a pdf using the pdftools package. We used tidyverse packages such as dplyr, stringr, and tidy to clean the data and get it in a workable format. Our statistical analysis focused on evaluating differences in BMI in females around the world across time and between rural and urban areas. We found a significant difference both between years and between the type of community using t tests and nonparametric tests. Thus BMI has increased in women since 1985. Although BMI estimates are significantly higher in Urban areas compared to rural areas, BMI estimates have increased in both regions. Using the ggplot2 package we were able to visualize trends in the data. Importantly, the largest increase appears to be in the rural areas particularly for women. Analyses like this are important for defining which groups could benefit the most from interventions, education, and policy changes when attempting to mitigate public health challenges.
Students can evaluate the change in BMI over time using the global data available for each year between 2015 and 2017. This data can be found here.